magic starSummarize by Aili

Generative Verifiers: Reward Modeling as Next-Token Prediction

๐ŸŒˆ Abstract

The paper proposes Generative Verifiers (GenRM), which recast verification as next-token prediction in large language model (LLM) reasoning domains. Key points:

  • GenRM is a more performant alternative to discriminative reward models, and unlocks the use of powerful tools like chain-of-thought reasoning and majority voting for better verification.
  • GenRM unifies generation and verification into a single LLM, and demonstrates that such unification benefits both generation and verification.
  • GenRM can effectively utilize synthetic model-generated rationales, which are noisy and sub-optimal, to identify reasoning errors in grade school math problems.

๐Ÿ™‹ Q&A

[01] Comparing GenRM with Prior Verification Approaches

1. How does GenRM compare to standard discriminative verifiers and other approaches on reasoning tasks?

  • GenRM, which directly predicts Yes/No token for verification, can match or outperform the discriminative reward model (RM) and other approaches like LLM-as-a-Judge and self-consistency on algorithmic tasks like Last Letter Concatenation and Word Sorting, as well as the GSM8K math reasoning task.
  • GenRM-CoT, which combines chain-of-thought with majority voting, further improves the performance over direct GenRM.
  • On GSM8K, GenRM-CoT consistently outperforms all other methods, even when using model-generated (rather than human-written) verification rationales.

2. How does GenRM's use of chain-of-thought reasoning and majority voting impact its performance?

  • With oracle verification CoTs, GenRM-CoT closely matches the performance of an oracle verifier on the algorithmic tasks.
  • On GSM8K, GenRM-CoT is able to detect subtle reasoning errors that are missed by discriminative verifiers, by leveraging the chain-of-thought rationales.
  • Majority voting across multiple CoT rationales generated by GenRM-CoT further boosts its accuracy, allowing it to nearly match the performance of an oracle verifier on algorithmic tasks.

[02] Unifying Generation and Verification

1. How does unifying solution generation with verification impact GenRM's performance?

  • Unifying solution generation with verification, as done by GenRM using the next-token-prediction objective, consistently improves verification performance across all tasks compared to training GenRM solely on verification data.
  • Incorporating CoT verification data into the generator's training mix leads to better solution generation performance for the GenRM-CoT verifier itself.
  • This suggests that teaching the verifier to imitate correct solutions through next-token prediction is mutually beneficial for both generation and verification.

[03] Scaling Data, Model Size, and Inference-time Compute

1. How does GenRM-CoT's performance scale with increased inference-time compute?

  • GenRM-CoT's performance scales gracefully with greater number of CoT rationale samples used for majority voting, outperforming greedy decoding performance within just 4 votes.
  • Across different Gemma model scales (2B, 7B, 9B), the finetuned GenRM-CoT verifier outperforms the LLM-as-a-Judge approach, which also utilizes CoT and majority voting but with a more capable Gemini 1.0 Pro model.

2. How does GenRM's performance scale with increasing model size and training data?

  • The performance of GenRM and GenRM-CoT verifiers scales positively with an increase in Gemma model capacity, matching the expectation that larger models can learn more from the same data under the next-token prediction loss.
  • For GenRM-CoT on GSM8K, using multiple rationales per solution has a substantial positive effect on both RM accuracy and Best-of-N performance, suggesting the model benefits from the "ensembling" effect of training on noisy synthetic rationales.
  • Direct GenRM verifiers trained only on verification data still outperform standard discriminative RMs as the amount of training data increases, demonstrating the effectiveness of casting verification as a next-token prediction problem.

[04] Impact of Synthetic Rationale Quality

1. How does the quality of synthetic rationales impact GenRM-CoT's performance on GSM8K?

  • Using reference-guided grading to generate the synthetic rationales significantly improves GenRM-CoT's performance on GSM8K compared to using unguided synthetic rationales.
  • This indicates that LLMs are better able to identify reasoning errors when provided with a reference solution for comparison, even when using the same model (Gemini 1.0 Pro) to generate both the solutions and rationales.

</output_format>

Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.